#knitr::opts_chunk$set(echo = TRUE)
library(survey) #Used for statistical analysis
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
library(sampling) #Used for sample data
##
## Attaching package: 'sampling'
## The following objects are masked from 'package:survival':
##
## cluster, strata
library(MASS) #USed for QDA
library(klaR) #Used for partimat
library(tree) #used for the tree model
library(VGAM) #Used for Multinominal Regression model
## Loading required package: stats4
## Loading required package: splines
##
## Attaching package: 'VGAM'
## The following object is masked from 'package:survey':
##
## calibrate
library(caret) #Used for creating folds
## Loading required package: ggplot2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:VGAM':
##
## predictors
## The following object is masked from 'package:sampling':
##
## cluster
## The following object is masked from 'package:survival':
##
## cluster
library(ggplot2) #Used for plot
library(tibble) #used for nice display dataframe
library(magrittr) #Used for %>%
library(tidyverse) #Used to manipualte dataframe
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tidyr 1.2.1 ✔ dplyr 1.0.10
## ✔ readr 2.1.3 ✔ stringr 1.5.0
## ✔ purrr 1.0.1 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ tidyr::extract() masks magrittr::extract()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ tidyr::pack() masks Matrix::pack()
## ✖ dplyr::select() masks MASS::select()
## ✖ purrr::set_names() masks magrittr::set_names()
## ✖ tidyr::unpack() masks Matrix::unpack()
library(dplyr) # Used for easy manipualte rows and columns
library(plyr) #Used for mapping vector values
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following object is masked from 'package:purrr':
##
## compact
library(stringr) #Used to manipulate string
library(gridExtra) #Used to arrange the ggplot chart
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(rpart) #Used for classification tree
library(rpart.plot) #Used for plotting classification tree
Mobile phones are everywhere, so are the prices. Despite still having the word “phone” in the name, a typical modern smartphone has much more features than just to make and receive calls. They are boasting a staggering range of applications like brand, memory, storage, camera, resolution, just to name a few. All these cutting edge technology and features packed in one little device does not come without a cost. A 2020 review of premium mobile phones shows a staggering 490% rise in the last two decades.
With so many mobile phones on the market, it can be difficult to decide which one you want to buy. As a customer, we are particularly interested in finding some relation between all these features and its selling price. To this purpose, we collected the MobilePhone’s dataset from Kaggle and apply a set of statistical analysis hoping to answer some guiding questions:
The initial dataset consists of 8 columns and 28,036 rows and no missing values. These 8 columns are:
Some initial steps can be completed to clean the dataset and create new variables which can be used in our analysis. The initial steps for cleaning the dataset are as follows:
mobile_dataset <- read.csv("./Updated_Mobile_Dataset.csv")
#Step 0: convert Model and Company letter to uppercase.
mobile_dataset$Model <- toupper(mobile_dataset$Model)
mobile_dataset$Company <- toupper(mobile_dataset$Company)
#Break down Model column into Model only and Color columns
#Step 1: remove the Company name from Model
Model_no_Company <- stringr::str_remove(mobile_dataset$Model,mobile_dataset$Company) %>% trimws(., which = c("both"))
#Step 2: remove anything after parentheses to get Model only info
mobile_dataset$Model_Only <- stringr::str_replace(Model_no_Company, " \\s*\\([^\\)]+\\)", "")
#Step 3: Get the color information from inside the last ()
#Step 3.1: Get the parenthesis and what is inside from the last ()
Model_no_Company_parenthesis <- stringr::str_extract(Model_no_Company, "(?<=\\()([^()]*?)(?=\\)[^()]*$)")
#step 3.2: Get the color by just retaining the info before ,
mobile_dataset$Color <- gsub(",.*$", "", Model_no_Company_parenthesis)
#Step 4: Remove duplicated rows in the dataset
mobile_dataset <- mobile_dataset[!duplicated(mobile_dataset),]
#Step 5: cut the price based on the percentile into 4 different levels
mobile_dataset <- mobile_dataset %>% mutate(Price_Level = ntile(Price, n = 4))
#Step 5.1: map each number level to the character
from <- c(1,2,3,4)
to <- c("Low", "Medium", "High", "Very High")
mobile_dataset$Price_Level <- mapvalues(mobile_dataset$Price_Level, from = from, to = to)
#Step 6: Get the numeric part of RomSize (remove GB and MB, but convert MB to GB), discard any record that no numeric in RomSize
#Step 6.1: there are some data input errors for RamSize and RomSize. In the records where RomSize is "Not Known" are swapped with RamSize, so we need to correct that.
RamSize_temp <- ifelse(mobile_dataset$RomSize == "Not Known", "0 GB", mobile_dataset$RamSize)
mobile_dataset$RomSize <- ifelse(mobile_dataset$RomSize == "Not Known", mobile_dataset$RamSize, mobile_dataset$RomSize)
mobile_dataset$RamSize <- RamSize_temp
#Step 6.2: split RomSize into two columns with size number and unit, and convert MB to 1/1000GB, KB to 1/1000000GB
mobile_dataset$RamSize_Ori <- mobile_dataset$RamSize
mobile_dataset$RomSize_Ori <- mobile_dataset$RomSize
mobile_dataset <- mobile_dataset %>% separate(RomSize, c("RomSize_num", "RomSize_Unit")) %>% mutate(RomSize_Unit= mapvalues(.$RomSize_Unit, from = c("GB", "MB", "KB"), to = c(1, 1/1000, 1/1000000)))
## Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [573].
#step 6.3: remove any rows that are not numeric value for RomSize
mobile_dataset <- mobile_dataset[!is.na(as.numeric(mobile_dataset$RomSize_num)),]
## Warning in
## `[.data.frame`(mobile_dataset, !is.na(as.numeric(mobile_dataset$RomSize_num)), :
## NAs introduced by coercion
mobile_dataset$RomSize_num <- as.numeric(mobile_dataset$RomSize_num)
mobile_dataset$RomSize_Unit <- ifelse(is.na(as.numeric(mobile_dataset$RomSize_Unit)), 0, as.numeric(mobile_dataset$RomSize_Unit))
#Step 6.4: generate the final column RomSize_inGB
mobile_dataset$RomSize_inGB <- mobile_dataset$RomSize_num * mobile_dataset$RomSize_Unit
#Step 7: Get the numeric part of RamSize (remove GB and MB, but convert MB to GB), discard any record that no numeric in RamSize
#Step 7.1: split RamSize into two columns with size number and unit, and convert MB to 1/1000GB
mobile_dataset <- mobile_dataset %>% separate(RamSize, c("RamSize_num", "RamSize_Unit")) %>% mutate(RamSize_Unit= mapvalues(.$RamSize_Unit, from = c("GB", "MB"), to = c(1, 1/1000)))
#step 7.2: remove any rows that are not numeric value for RamSize
mobile_dataset <- mobile_dataset[!is.na(as.numeric(mobile_dataset$RamSize_num)),]
mobile_dataset$RamSize_num<- as.numeric(mobile_dataset$RamSize_num)
mobile_dataset$RamSize_Unit <- as.numeric(mobile_dataset$RamSize_Unit)
#Step 7.3: generate the final column RamSize_inGB
mobile_dataset$RamSize_inGB <- mobile_dataset$RamSize_num * mobile_dataset$RamSize_Unit
#Step 8: Create a new column to determine if the phone is 5G or not
mobile_dataset$Is_5G <- ifelse(str_detect(mobile_dataset$Model_Only, "5G"), "Yes", "No")
#Step 9: only keep the columns we need
column_names <- c("Model", "Company", "Price", "Rating", "No_of_ratings", "TotalReviwes", "Model_Only", "Color", "Price_Level", "RamSize_inGB", "RomSize_inGB", "RamSize_Ori", "RomSize_Ori", "Is_5G" )
mobile_dataset <- mobile_dataset[column_names]
# #Step 9: final check up. Convert all company name to uppercase and then do a final duplicated removal
# mobile_dataset$Company <- toupper(mobile_dataset$Company)
# mobile_dataset <- mobile_dataset[!duplicated(mobile_dataset),]
write.csv(mobile_dataset, file = './Cleaned_Mobile_Dataset.csv', row.names = F)
After cleaning and breaking down columns, the dataset now consists of 14 columns and 725 rows and no missing values. These 14 columns are:
#mobile_dataset <- as_tibble(read.csv("./Cleaned_Mobile_Dataset.csv"))
#Specify the level for Price_Level column
mobile_dataset$Price_Level <- factor(mobile_dataset$Price_Level, levels = c("Low", "Medium", "High", "Very High"))
head(mobile_dataset, 4)
Table 1: The cleaned-up dataset for Mobile Phone
The dataset and detailed analysis can be found at this repository.
Our team is finalizing what the full analysis of the dataset will look like, but a preliminary template and breakdown of work by team member has been included below. The different colors represent which components of the project different team members would take on. It is anticipated that all members will assist in the finalization of the report.
The dimension of the cleaned dataset is 726 rows and 14 columns. A summary of the data is as follows:
mobile_dataset <- as_tibble(read.csv("./Cleaned_Mobile_Dataset.csv"))
#Specify the level for Price_Level column
mobile_dataset$Price_Level <- factor(mobile_dataset$Price_Level, levels = c("Low", "Medium", "High", "Very High"))
mobile_dataset %>% summary()
## Model Company Price Rating
## Length:725 Length:725 Min. : 698 Min. :2.700
## Class :character Class :character 1st Qu.: 7014 1st Qu.:4.100
## Mode :character Mode :character Median : 12889 Median :4.200
## Mean : 15903 Mean :4.227
## 3rd Qu.: 16999 3rd Qu.:4.300
## Max. :149900 Max. :4.800
## No_of_ratings TotalReviwes Model_Only Color
## Min. : 3 Min. : 0 Length:725 Length:725
## 1st Qu.: 874 1st Qu.: 80 Class :character Class :character
## Median : 7325 Median : 670 Mode :character Mode :character
## Mean : 34522 Mean : 2496
## 3rd Qu.: 37211 3rd Qu.: 2972
## Max. :575907 Max. :33942
## Price_Level RamSize_inGB RomSize_inGB RamSize_Ori
## Low :171 Min. : 0.000 Min. : 0.00 Length:725
## Medium :185 1st Qu.: 0.064 1st Qu.: 32.00 Class :character
## High :185 Median : 4.000 Median : 64.00 Mode :character
## Very High:184 Mean : 3.882 Mean : 81.78
## 3rd Qu.: 6.000 3rd Qu.:128.00
## Max. :12.000 Max. :512.00
## RomSize_Ori Is_5G
## Length:725 Length:725
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Table 2: A summary for Mobile Phone dataset
This section is focused on the exploration of relation between variables using various visualization techniques.
The main target is Price in the dataset. In the following visualization it shows the distribution of Mobile Phone price with long tail towards to the high end. The average price is 15902 Rupees. Most price falls in the range of 50000 Rupees. One really stands out price range (arrow indicated) is 1000-2000 Rupees with the highest frequency.
ggplot(data = mobile_dataset, mapping = aes(x=Price)) + geom_histogram(color="black", fill="white", bins = 100)+
geom_vline(aes(xintercept=mean(Price)),
color="blue", linetype="dashed", size=0.4) + geom_label(
label="Mean Price = 15902",
x= mean(mobile_dataset$Price) + 10000,
y=100,
label.padding = unit(0.35, "lines"),
label.size = 0.25,
color = "black",
fill="#69b3a2"
) + geom_segment(aes(x = -5000, y = 140, xend = 0, yend = 130),lineend = "round",linejoin = "round",
arrow = arrow(length = unit(0.5, "cm"))) + labs(y="Count", x="Price in Rupees",
subtitle="Distribution of Mobile Phone price")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
The following figure is to break down the Price by different phone-making companies. There are a total of 37 companies in the dataset. Iin consistent with previous figure about Price distribution, we see a big portion of Price falls between 9000 (left dash line) and 40000 (right dash line) Rupees. Apple alone contributes the most of high priced Mobile Phones, while companies like from NOKIA to GFIVE contribute to the low-priced ones.
ggplot(data = mobile_dataset, mapping = aes(y = reorder(Company, Price), x = Price)) +geom_boxplot()+ geom_line() +
sapply(c(6499, 12490, 20000), function(xint) geom_vline(aes(xintercept = xint), color="blue", linetype="dashed", size=0.4)) + labs(y="Company", x="Price in Rupees",
subtitle="Boxplot for Mobile Phone price from different companies")
The Price is further broken down by their 5G services. Interestingly, neither high-priced nor low-priced Mobile Phones have equipped 5G service. In contrast, almost all middle-priced Mobile Phones have 5G service. Price for those with 5G service are slightly more expensive than those without. NOTE: the dashed line indicates the price between 9000 (left dash line) and 40000 (right dash line) Rupees.
ggplot(data = mobile_dataset, mapping = aes(y = reorder(Company, Price), x = Price, color = Is_5G)) +geom_boxplot(position=position_dodge(width = 0.8))+ geom_line() +
sapply(c(9000, 40000), function(xint) geom_vline(aes(xintercept = xint), color="blue", linetype="dashed", size=0.4)) + labs(y="Company", x="Price in Rupees",
subtitle="Boxplot for Mobile Phone price from different companies with 5G service (No / Yes)") + facet_wrap(~Is_5G)
The pattern is more obvious when we provide color for 4 different price
level (“Low”, “Medium”, “High”, “Very High”). Although mobile phones
from companies like APPLE, GOOGLE, ASUS and NOTHING have no 5G service,
their price are all very high. It is worth investigating if some other
attributes like brand effect, Storage capacity (Rom Size), and
calculation power (Ram Size) are playing roles. Most Mobile Phones
(except MI) with 5G service, again, have a higher proportion falling in
a Very High price level, compared to the counterparts without 5G
service. Almost all the rest mobile phones without 5G service fall in a
Low price level.
ggplot(data = mobile_dataset,mapping = aes(y =reorder(Company, Price), fill = Price_Level)) + geom_bar(stat = "count", position="fill") + labs(y="Company", x="Percentage",
subtitle="Percent Boxplot for Mobile Phone price from different companies with 5G service (No / Yes)") + facet_wrap(~Is_5G)
Next, we explored some quantitative features with respect to the Mobile Phone price. The first scatter plot shows an obvious positive relation between Rating and Price as the smooth line indicates. The higher the Rating, the higher the Price. And this holds when we remove those bottom left 2 outlier data points.
#The first plot is one with outlier
plot1 = ggplot(data = mobile_dataset,mapping = aes(x = Rating, y = Price)) + geom_point(alpha=0.3) + geom_smooth(se=T) +
geom_segment(aes(x = 2.6, y = 20000, xend = 2.67, yend = 10000),lineend = "round",linejoin = "round",arrow = arrow(length = unit(0.5, "cm"))) +
geom_point(data=mobile_dataset[mobile_dataset$Rating<3.0,], aes(x = Rating, y = Price), color='red') +
labs(y="Price in Rupees", x="Rating", subtitle="Relation between rating and price")
#The first plot is a one with outliers removed
plot2 = ggplot(data = mobile_dataset[mobile_dataset$Rating>3.0, ],mapping = aes(x = Rating, y = Price)) + geom_point(alpha=0.3) + geom_smooth(se=T) +
labs(y="Price in Rupees", x="Rating", subtitle="Relation between rating and price after removing \nbottom left 2 outliers")
grid.arrange(plot1, plot2, ncol=2)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The second scatter plot as follows shows the relation between Total number of rating and Price. As the smooth line indicates, there is no obvious correlation between them. The result holds even when we remove some seemingly outliers on the far right end.
ggplot(data = mobile_dataset,mapping = aes(x = No_of_ratings, y = Price)) + geom_point(alpha=0.3) + geom_smooth()+ labs(y="Price in Rupees", x="Total number of rating", subtitle="Relation between total nubmer of rating and price")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The same pattern is also observed in the third scatter plot where
relation between Total number of reviews and Price is plotted. As the
smooth line indicates, there is no obvious correlation between them. The
result holds even when we remove some seemingly outliers on the far
right end.
ggplot(data = mobile_dataset,mapping = aes(x = TotalReviwes, y = Price)) + geom_point(alpha=0.3) + geom_smooth()+ labs(y="Price in Rupees", x="Total number of reviews", subtitle="Relation between total nubmer of reviews and price")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
In the fourth scatter plot, below, there seems a slight positive
relation between Ram size (in GB) and the Price. Ram is normally
associated with speed and performance of an operating system. The higher
ram is, the better it is in speed and performance. It makes sense that a
mobile phone with a larger Ram will charge more. However, because
nowadays most mobile phones are very fast and stable, people wont tell
too much difference, making the added on value from ram is only weekly
associated with price.
plot1 = ggplot(data = mobile_dataset,mapping = aes(x = RamSize_inGB, y = Price)) + geom_point(alpha=0.3) + geom_smooth(se=T)+ labs(y="Price in Rupees", x="Ram Size in GB", subtitle="Relation between Ram size (in GB) and price")
plot2 = ggplot(data = mobile_dataset,mapping = aes(x = RomSize_inGB, y = Price)) + geom_point(alpha=0.3) + geom_smooth(se=T, method = lm)+
geom_segment(aes(x = 490, y = 100000, xend = 500, yend = 90000),lineend = "round",linejoin = "round",arrow = arrow(length = unit(0.5, "cm"))) +
geom_point(data=mobile_dataset[mobile_dataset$RomSize_inGB>400,], aes(x = RomSize_inGB, y = Price), color='red') +
labs(y="Price in Rupees", x="Rom Size in GB", subtitle="Relation between Rom size (in GB) and price")
grid.arrange(plot1, plot2, ncol=2)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
In the final scatter plot, again, we have two charts, one the original
dataset, the other with outlier removed. Both cases tells the same
story. In contrast to the previous scatter plot (Ram size (in GB) and
the Price), there is a very obvious positive linear relation between Rom
size (in GB) and the Price. This makes sense because often times a
mobile phone is more limited by its non-volatile memory space than its
speed & performance.
#The first plot is a one with outliers
plot1 = ggplot(data = mobile_dataset,mapping = aes(x = RomSize_inGB, y = Price)) + geom_point(alpha=0.3) + geom_smooth(se=T, method = lm)+
geom_segment(aes(x = 490, y = 100000, xend = 500, yend = 90000),lineend = "round",linejoin = "round",arrow = arrow(length = unit(0.5, "cm"))) +
geom_point(data=mobile_dataset[mobile_dataset$RomSize_inGB>400,], aes(x = RomSize_inGB, y = Price), color='red') +
labs(y="Price in Rupees", x="Rom Size in GB", subtitle="Relation between Rom size (in GB) and price")
#The second plot is a one with outliers removed
plot2 = ggplot(data = mobile_dataset[mobile_dataset$RomSize_inGB<400, ],mapping = aes(x = RomSize_inGB, y = Price)) + geom_point(alpha=0.3) + geom_smooth(se=T,method = lm) +
labs(y="Price in Rupees", x="Rom Size in GB", subtitle="Relation between Rom size (in GB) and price")
grid.arrange(plot1, plot2, ncol=2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
In this section, we will use a range of statistical models to predict mobile phone price level: “Low”, “Medium”, “High”, “Very High” based on some selected features.
First, let use explore the distribution for difference price level. From the following bar plot, you can see it is very close to uniformly distributed across the dataset. Balanced dataset is important for us when we do the classification.
price_level_count <- table(mobile_dataset$Price_Level) %>% as.data.frame()
ggplot(data = price_level_count, mapping = aes(x = Var1, y = Freq)) + geom_bar(stat = "identity") +
labs(y="Price Level", x="Count", subtitle="Count for different Price Level")
### 2.1.1 Quadratic Discriminant Analysis (LDA) We want to use LDA to
classify our observations into mobile phone price level: “Low”,
“Medium”, “High”, “Very High”. However, we know LDA has 2
assumptions:
Based on EDA in section 4.2, we know three numeric features have a great impact on price level a mobile phone belongs to, which are Rating, RamSize_inGB and RomSize_inGB. As such, we will do a test for normal distribution for each predictor in each price level. Because we have 4 different price levels and 3 variables, the resulting normality check would be 4*3 = 12 comparison. We use Q-Q plot and Shapiro test for normality check.
#split the dataset based on the price level
mobile_dataset_low <- mobile_dataset[mobile_dataset$Price_Level == "Low", ]
mobile_dataset_medium <- mobile_dataset[mobile_dataset$Price_Level == "Medium", ]
mobile_dataset_high <- mobile_dataset[mobile_dataset$Price_Level == "High", ]
mobile_dataset_very_high <- mobile_dataset[mobile_dataset$Price_Level == "Very High", ]
#generate the QQ plot for each price level
variables <- c("Rating", "RamSize_inGB", "RomSize_inGB")
par(mfrow = c(1,3))
for (i in variables){
qqnorm(mobile_dataset_low[[i]], main = paste("Price Level LOW, Normal Q-Q Plot for", i))
qqline(mobile_dataset_low[[i]], col = 3)
}
par(mfrow = c(1,3))
for (i in variables){
qqnorm(mobile_dataset_medium[[i]], main = paste("Price Level Medium, Normal Q-Q Plot for", i))
qqline(mobile_dataset_medium[[i]], col = 3)
}
par(mfrow = c(1,3))
for (i in variables){
qqnorm(mobile_dataset_high[[i]], main = paste("Price Level High, Normal Q-Q Plot for", i))
qqline(mobile_dataset_high[[i]], col = 3)
}
par(mfrow = c(1,3))
for (i in variables){
qqnorm(mobile_dataset_high[[i]], main = paste("Price Level Very High, Normal Q-Q Plot for", i))
qqline(mobile_dataset_high[[i]], col = 3)
}
#We also run the shapiro test to statistically conclude the result
p_value_rating <- c(shapiro.test(mobile_dataset_low$Rating)$p.value, shapiro.test(mobile_dataset_medium$Rating)$p.value,
shapiro.test(mobile_dataset_high$Rating)$p.value,
shapiro.test(mobile_dataset_very_high$Rating)$p.value)
p_value_RamSize_inGB <- c(shapiro.test(mobile_dataset_low$RamSize_inGB)$p.value, shapiro.test(mobile_dataset_medium$RamSize_inGB)$p.value,
shapiro.test(mobile_dataset_high$RamSize_inGB)$p.value,
shapiro.test(mobile_dataset_very_high$RamSize_inGB)$p.value)
p_value_rating_RomSize_inGB <- c(shapiro.test(mobile_dataset_low$RomSize_inGB)$p.value, shapiro.test(mobile_dataset_medium$RomSize_inGB)$p.value,
shapiro.test(mobile_dataset_high$RomSize_inGB)$p.value,
shapiro.test(mobile_dataset_very_high$RomSize_inGB)$p.value)
p_value_df <- data.frame(p_value_rating,p_value_RamSize_inGB,p_value_rating_RomSize_inGB)
rownames(p_value_df) <- c("Low", "Medium", "High", "Very High")
p_value_df
Based on the Q-Q plots (dots not fall on the line) and Shapiro test (p value is < 5%), we can conclude that all variables are not normal. Therefore, we will not preceded with LDA model. Although Quadratic Discriminant Analysis (QDA) also requires normality assumption, we still want to try QDA (in the next section) since it normally has less strict assumptions. NOTE, the above code is inspired by Pascal Schmidt from post Assumption Checking of LDA vs. QDA – R Tutorial Pima Indians Data Set.
In consistent with LDA, we use the same three numeric features that have a great impact on price level a mobile phone belongs to, which are Rating, RamSize_inGB and RomSize_inGB. To that end, we split the dataset into 75% for training and 25% for testing purpose. We use each “Price_Level” as the stratum and the proportional allocation principle to get a stratified sample. Then we fit the data into Quadratic Discriminant Analysis (QDA) as follows:
set.seed(2023)
Ny <- round(table(mobile_dataset$Price_Level)[unique(mobile_dataset$Price_Level)]* 0.75)
#Create the sample from the population
idx=sampling:::strata(mobile_dataset, stratanames=c("Price_Level"), size=Ny, method="srswor")
strata_sample <- getdata(mobile_dataset,idx)
train <- mobile_dataset[idx$ID_unit,]
test <- mobile_dataset[-idx$ID_unit,]
#Train the data using Quadratic Discriminant Analysis
Model.fit<-qda(Price_Level~Rating+RamSize_inGB+RomSize_inGB, data = train)
Model.fit
cat(rep("-", 50), "\n")
#This plot is showing that LD1 can do a really good to separate these species rather than LD2
# partimat(Price_Level~Rating+RamSize_inGB+RomSize_inGB, data = train, method="qda")
## Call:
## qda(Price_Level ~ Rating + RamSize_inGB + RomSize_inGB, data = train)
##
## Prior probabilities of groups:
## Low Medium High Very High
## 0.2352941 0.2555147 0.2555147 0.2536765
##
## Group means:
## Rating RamSize_inGB RomSize_inGB
## Low 4.012500 0.3342734 5.949688
## Medium 4.237410 3.8201439 62.848921
## High 4.276259 5.6258993 113.266187
## Very High 4.341304 5.5797536 142.376812
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
At last, we apply this model on the testing dataset, and calculate the misclassification rate.
Prob.predict<-predict(Model.fit,test,type="response")
ggplot(as.data.frame(table(Prob.predict$class,test$Price_Level)), aes(x = Var1, y = Var2, fill = Freq)) +geom_tile() + geom_text(aes(label = round(Freq, 1))) + scale_fill_gradient(low = "white", high = "red") + labs(y="True", x="Predict", subtitle="Confusion Matrix for Classification Tree")
cat(rep("-", 45), "\n")
cat("The misclassification rate for QDA is: ", mean(Prob.predict$class != test$Price_Level))
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
## The misclassification rate for QDA is: 0.2265193
Conclusion: Quadratic Discriminant Analysis has a good performance to predict the mobile phone price level using only 3 variables ( Rating, RamSize_inGB and RomSize_inGB), with a relatively high accuracy (77%).
To be consistent and comparable with QDA, we again use these three variables Rating, RamSize_inGB and RomSize_inGB to build the decision tree model. We split the dataset into 75% for training and 25% for testing purpose. We use each “Price_Level” as the stratum and the proportional allocation principle to get a stratified sample. Then we fit the data into Classification tree model as follows:
set.seed(2023)
Ny <- round(table(mobile_dataset$Price_Level)[unique(mobile_dataset$Price_Level)]* 0.75)
#Create the sample from the population
idx=sampling:::strata(mobile_dataset, stratanames=c("Price_Level"), size=Ny, method="srswor")
strata_sample <- getdata(mobile_dataset,idx)
train <- mobile_dataset[idx$ID_unit,]
test <- mobile_dataset[-idx$ID_unit,]
#Fit the train data into the tree model
tree.class<-tree(factor(Price_Level)~Rating+RamSize_inGB+RomSize_inGB, train)
summary(tree.class)
##
## Classification tree:
## tree(formula = factor(Price_Level) ~ Rating + RamSize_inGB +
## RomSize_inGB, data = train)
## Variables actually used in tree construction:
## [1] "RomSize_inGB" "RamSize_inGB"
## Number of terminal nodes: 9
## Residual mean deviance: 1.055 = 564.3 / 535
## Misclassification error rate: 0.2335 = 127 / 544
The tree is plotted without pruning as follows. There are a total of 9 terminal nodes in the tree.
#Let us plot the tree as well
plot(tree.class)
text(tree.class,pretty=0)
# rpart_fit <- rpart(factor(Price_Level)~Rating+RamSize_inGB+RomSize_inGB, train, method = 'class')
# rpart.plot(rpart_fit)
To find the best tree size, we used cross-validation to select the “best” number of tree size (the terminal nodes). Based on the following plot, it seems tree size 9 gives the smallest deviance. Just to demonstrate how prunned tree works and also reduce the complexity of tree, we decide to use 8 as the “best” tree size.
cv.class<-cv.tree(tree.class, FUN = prune.misclass)
plot(cv.class$size, cv.class$dev,type="b")
A tree after prunning is plotted as follows. There are a total of 8
terminal nodes in the tree.
prune.class=prune.tree(tree.class,best=8)
plot(prune.class)
text(prune.class,pretty=0)
Finally, we use the pruned tree to do prediction.
prune.pred=predict(prune.class,test,type="class")
ggplot(as.data.frame(table(prune.pred,test$Price_Level)), aes(x = prune.pred, y = Var2, fill = Freq)) +geom_tile() + geom_text(aes(label = round(Freq, 1))) + scale_fill_gradient(low = "white", high = "red") + labs(y="True", x="Predict", subtitle="Confusion Matrix for Classification Tree")
cat(rep("-", 45), "\n")
cat("The misclassification rate for regression tree is: ", mean(prune.pred != test$Price_Level))
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
## The misclassification rate for regression tree is: 0.2375691
Conclusion: Decision Tree (Classification) has also a good performance to predict the mobile phone price level using only 3 variables ( Rating, RamSize_inGB and RomSize_inGB), with a relatively high accuracy (76.2%).
This is an experimental analysis. Based on EDA in section 4.2, we know that Company name (branding) also plays a big role in the price level. We also know that Rom size of a mobile phone matters a lot as well. We are curious to see if a combination of Company and RomSize can also be used to distinguish the price level.
Normally, we would split the dataset into 75% for training and 25% for testing purpose. However, because our sample size is small, not every Company has multiple records in our dataset. Therefore, in the following test, we use whole sample as the training and test purpose.
Then we fit the data into Multinomial Regression as follows. Note, because our response is ordinal, we use Cumulative logit model; also parallel = TRUE: means beta is the same, this is a simple model.
We also use Chi square test to test for the goodness-of-fit for this model. \[ H_0: \text{The Cumulative logit model fits the observations} \] \[ H_1: \text{The Cumulative logit model does not fit the observations} \]
comany_rom_train <- mobile_dataset[c("Company", "RomSize_Ori", "Price_Level")]
multi_train_fit <- vglm(Price_Level~Company+RomSize_Ori,family=cumulative(parallel = TRUE),data=comany_rom_train)
cat("The p value for the chi test is: ", 1-pchisq(deviance(multi_train_fit),df.residual(multi_train_fit)))
cat("\nThe p-value here is large = 1, so we cannot reject null hypothesis. Our model seems to be provide a good fit.")
## The p value for the chi test is: 1
## The p-value here is large = 1, so we cannot reject null hypothesis. Our model seems to be provide a good fit.
Finally, we use this model to predict use the same training dataset, as a demonstration purpose.
pred_fit <- predict(multi_train_fit, type="response")
#Use the record that has the highest probability as the final predication for each row.
pred_fit_class <-factor(dimnames(pred_fit)[[2]][max.col(pred_fit, 'first')], levels = c("Low", "Medium", "High", "Very High"))
ggplot(as.data.frame(table(pred_fit_class, mobile_dataset$Price_Level)), aes(x = pred_fit_class, y = Var2, fill = Freq)) +geom_tile() + geom_text(aes(label = round(Freq, 1))) + scale_fill_gradient(low = "white", high = "red") + labs(y="True", x="Predict", subtitle="Confusion Matrix for Classification Tree")
cat(rep("-", 45), "\n")
cat("The misclassification rate for Multinomial Regression is: ", mean(pred_fit_class != mobile_dataset$Price_Level))
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
## The misclassification rate for Multinomial Regression is: 0.2813793
Conclusion: Multinomial Regression has also a good performance to predict the mobile phone price level using only 2 variables ( Company and RomSize), with a relatively high accuracy (71.9%).
In section 5.1, we have compared three different models based on their misclassification rate: Quadratic Discriminant Analysis (QDA) = 0.2265193, Decision Tree (Classification) = 0.2375691, and Multinomial Regression = 0.2813793. In order to compare model performance, we decide to also use stratified \(10\)-fold cross-validation to calculate the cross-validation errors of these 3 methods.
cat("\nFirst I am using the 10-fold cross validation for Quadratic Discriminant Analysis")
set.seed(2023)
#Create 10 folds
folds<-createFolds(mobile_dataset$Price_Level, k=10)
#Define a function to calculate the misclassification rate for each fold
misrate_fold<-function(idx){
Train<-mobile_dataset[-idx,]
Test<-mobile_dataset[idx,]
fit <- qda(Price_Level~Rating+RamSize_inGB+RomSize_inGB, data = Train)
pred<-predict(fit,Test)
return(1- mean(pred$class == Test$Price_Level))
}
#Get the misclassification rate from each fold
type_misrate_fold=lapply(folds,misrate_fold)
#The average of misclassification rate for 10 folds is:
cat("\nThe average of misclassification rate for 10 folds
cross validation of Quadratic Discriminant Analysis is :",
mean(as.numeric(type_misrate_fold)))
cat("\n")
cat(rep("-", 45), "\n")
cat("\nSecond I am using the 10-fold cross validation for Decision Tree (Classification)")
set.seed(2023)
#Create 10 folds
folds<-createFolds(mobile_dataset$Price_Level, k=10)
#Define a function to calculate the misclassification rate for each fold
misrate_fold<-function(idx){
Train<-mobile_dataset[-idx,]
Test<-mobile_dataset[idx,]
fit <- tree(factor(Price_Level)~Rating+RamSize_inGB+RomSize_inGB, data = Train)
pred<-predict(fit,Test, type = 'class')
return(1- mean(pred == Test$Price_Level))
}
#Get the misclassification rate from each fold
type_misrate_fold=lapply(folds,misrate_fold)
#The average of misclassification rate for 10 folds is:
cat("\nThe average of misclassification rate for 10 folds cross
validation of Decision Tree (Classification) is :", mean(as.numeric(type_misrate_fold)))
cat("\n")
cat(rep("-", 45), "\n")
cat("\nBecause our sample size is small, not every Company has multiple records in our dataset. \nTherefore, we are not able to perform cross-validation for Multinomial regression.")
##
## First I am using the 10-fold cross validation for Quadratic Discriminant Analysis
## The average of misclassification rate for 10 folds
## cross validation of Quadratic Discriminant Analysis is : 0.2690109
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
##
## Second I am using the 10-fold cross validation for Decision Tree (Classification)
## The average of misclassification rate for 10 folds cross
## validation of Decision Tree (Classification) is : 0.2314057
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
##
## Because our sample size is small, not every Company has multiple records in our dataset.
## Therefore, we are not able to perform cross-validation for Multinomial regression.
To summarize this part of analysis, we find some contradictory results. We used the balanced dataset trying to classify the Mobile Phone Price Level. In Section 5.1, we have the following misclassificaiton rate from each model:
We can see Quadratic Discriminant Analysis (QDA) gives the lowest misclassification rate, making this model the best one. However, when we do the 10-folds cross validation, we have the following misclassificaiton rate from each model:
From this result, Decision Tree (Classification) gives the lowest misclassification rate. Because cross-validation is averaging multiple misclassificaiton rate, we think this result is more reliable.
So our final conclusion is that Decision Tree (Classification) can best predict the Mobile Phone Price Level using only 3 variables ( Rating, RamSize_inGB and RomSize_inGB), with a relatively high accuracy (76.9%).
The goal of this section is to look at categorization of phones that fall within the most expensive bracket. Among these phones are Apple, Google, Samgsung, Vivo and Iqoo. Initially, we wanted to look at categorizing Apple, Google and Samgsung, but the number of entries of Google in the dataset prevented us from using them (Unable to split Google into testing and training). For this reason, we included Vivo and Iqoo. For categorization, we chose to compare four different classification methods: LDA, QDA, Logistic Regression and an Bayesian Classifier and KNN.
The first step prior to building the different classifiers was to read in the data into r.
cellphone = mobile_dataset
With the dataset in r, we are now able to reduce our dataframe to only contain the different companies which we want to perform classification on.
cellphone = cellphone[cellphone$Company %in% c("APPLE", "SAMSUNG", "VIVO", "IQOO"), ]
cellhone = cellphone %>% drop_na()
The next step is to chose which classifiers we want to use to perform analysis. Through playing around with the different attributes in the dataset, the rating, Price, number of ratings, total reviews and Rom Size were chosen as the variables used to classify company. Color was also considered, but since there are so many different unique colors, the test set would always contain a color not seen in training and the model would not function. With some possible exceptions, the color of the phone is likely not proprietary to the company and thus isn’t crutial for categorization.
df2 <- cellphone[, (names(cellphone) %in% c("Rating","Price","No_of_ratings","TotalReviwes","RomSize_inGB", "Company"))]
In order to test some of the assumptions requried to use the aforementioned classification methods, the order of the columns is changed so that company, our dependant variable, appears in the last column of the data frame.
order<- df2[,c(6,2,3,4,5,1)]
order$Company <- as.factor(as.character(order$Company))
First, we will test the assumption of homogeneity of covariance matrices. Homogeneity of covariance matrices is required to use LDA as a classification method. We can test for Homogeneity using the boxM test.
The hypotheses for the test are as follows:
\[ \begin{aligned} H_0&:\mbox{The observed covariance matrices for the dependent variables are equal across groups}\\ H_a&:\mbox{The observed covariance matrices for the dependent variables are NOT equal across groups} \end{aligned} \]
library(heplots)
## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:VGAM':
##
## logit
## Loading required package: broom
boxM(order[,1:5], order$Company)
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: order[, 1:5]
## Chi-Sq (approx.) = 509.88, df = 45, p-value < 2.2e-16
We can see that since P is very small, we reject the null hypothesis in favor of the alternate hypothesis, indicating we have heterogeneity of covariance matrices. Our assumption of homogeneity of covariance matrices for LDA therefor is not met.
We can try to normalize the data to see if it has an impact on the result.
order2 <- order %>% mutate_at(c("Rating","Price","No_of_ratings","TotalReviwes","RomSize_inGB"), ~(scale(.) %>% as.vector))
order3 = order2
library(heplots)
boxM(order2[,1:5], order$Company)
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: order2[, 1:5]
## Chi-Sq (approx.) = 509.88, df = 45, p-value < 2.2e-16
Normalization using the scale function has no effect on changing making the covariance matrices homogenous.
The covariance matrices do not need to be homogeneous to use QDA, but one of the assumtions of QDA is a gaussian distribution of the independant variables.
order2 = order[,1:5]
order2 <- as.matrix(order2[,1:5])
We are able to visualize the distribution of the indenpendant variable by using histograms of each of the data points.
par(mfrow = c(1, ncol(order2)))
for(j in 1:ncol(order2)){
hist(order2[,j])}
From the histograms, Price, number of ratings and the total reviews do not appear to follow a normal distribution. We can use the Shapiro-Wilk normality test to validate these results.
Below is the null hypothesis for the test.
\[ \begin{aligned} H_0&:\mbox{The data follow a Gaussian distribution}\\ H_a&:\mbox{The data does NOT follow a gaussian distribution} \end{aligned} \]
library(mvnormtest)
mshapiro.test(t(order2))
##
## Shapiro-Wilk normality test
##
## data: Z
## W = 0.83086, p-value = 7.943e-12
From p-value = 7.943e-12, we reject the null hypothesis in favor of the alternate hypothesis indicating that the independent variables do not follow a normal distribution.
We also need to test to see if multicollinearity is present in our predictors.
faraway::vif(order2)
## RomSize_inGB Price Rating No_of_ratings TotalReviwes
## 1.497726 3.018487 3.426773 31.038830 28.122784
It appears we need to remove number of ratings from our model because there is multicollinearity.
order3 <- order3[, -4]
order2 <- order2[, -4]
faraway::vif(order2)
## RomSize_inGB Price Rating TotalReviwes
## 1.495061 2.973678 2.915072 1.452722
While we have not met all the assumptions for LDA and QDA, we will try and do a classification using these methods regardless and look at the misclassification rates as a metric of their performance.
library(MASS)
library(ISLR)
set.seed(10)
sample <- sample(c(TRUE, FALSE), nrow(order2), replace=TRUE, prob=c(0.7,0.30))
train <- order3[sample, ]
test <- order3[-sample, ]
qda.fit<-qda(Company~Rating+Price+TotalReviwes+RomSize_inGB, data = train)
lda.fit<-lda(Company~Rating+Price+TotalReviwes+RomSize_inGB, data = train)
# Testing the LDA classifier with the test data
lda.pred=predict(lda.fit, test)
table(lda.pred$class, test$Company)
##
## APPLE IQOO SAMSUNG VIVO
## APPLE 46 0 0 0
## IQOO 0 0 0 0
## SAMSUNG 0 7 47 14
## VIVO 0 1 13 20
mean(lda.pred$class != test$Company)
## [1] 0.2364865
The misclassification rate with Lda is 0.2364865.
#Testing the QDA classifier with the test data
qda.pred=predict(qda.fit, test)
table(qda.pred$class, test$Company)
##
## APPLE IQOO SAMSUNG VIVO
## APPLE 46 0 0 0
## IQOO 0 5 1 0
## SAMSUNG 0 3 40 9
## VIVO 0 0 19 25
mean(qda.pred$class != test$Company)
## [1] 0.2162162
The misclassification rate with qda is 0.2147651.
We are also able to visualize how LDA and QDA separate the data using klaR’s partimat function. Using this tool it’s possible to visualize how each of the different independent variables categorizes the company. It’s also an important oberservation to notice how LDA makes distinctions with linear sections, where as the different sections of QDA are much more curved, hens ‘quadratic’.
Below is the lDA partiton plot.
library(klaR)
partimat(Company ~ Rating+Price+TotalReviwes+RomSize_inGB, data = order3,
method = "lda", nplots.vert = 1)
Below is the QDA partition plot.
partimat(Company ~ Rating+Price+TotalReviwes+RomSize_inGB, data = order3,
method = "qda", nplots.vert = 1)
Recall that linearity was not achieve with price, number of ratings and total reviews, so QDA will be tested again after removing these predictors to see if the result is improved.
#removal of non-linear predictors
order4 <- order3[, (names(order3) %in% c("Rating","RomSize_inGB", "Company"))]
Building the training and test data for the QDA classifier with only rating and RomSize_inGB as independent variables.
set.seed(10)
sample <- sample(c(TRUE, FALSE), nrow(order4), replace=TRUE, prob=c(0.7,0.30))
trainQ <- order4[sample, ]
testQ <- order4[-sample, ]
qda.fit<-qda(Company~Rating+RomSize_inGB, data = trainQ)
Applying the QDA classification model to the test set.
qda.pred=predict(qda.fit, testQ)
table(qda.pred$class, test$Company)
##
## APPLE IQOO SAMSUNG VIVO
## APPLE 44 0 1 2
## IQOO 0 1 0 0
## SAMSUNG 0 7 48 13
## VIVO 2 0 11 19
mean(qda.pred$class != test$Company)
## [1] 0.2432432
We can see that the misclassification rate has not changed significantly, and has increased to 0.2432432. Therefor, rating will not be removed from the predictors used in QDA.
Moving from LDA and QDA, we can look to see whether a logistic regression classifier has improved classification accuracy. We will use the same predictors as LDA and QDA to have a fair comparison of misclassification rates.
require(foreign)
## Loading required package: foreign
require(nnet)
## Loading required package: nnet
require(ggplot2)
require(reshape2)
## Loading required package: reshape2
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(VGAM)
Model.fit<-multinom(Company~Rating+Price+TotalReviwes+RomSize_inGB, data = train)
## # weights: 24 (15 variable)
## initial value 153.878674
## iter 10 value 50.014668
## iter 20 value 41.385169
## iter 30 value 39.045159
## iter 40 value 38.974093
## iter 50 value 38.974050
## iter 60 value 38.969317
## iter 60 value 38.969317
## iter 60 value 38.969317
## final value 38.969317
## converged
Prob.predict<-predict(Model.fit, test)
Actual<-test$Company
table(Prob.predict, Actual)
## Actual
## Prob.predict APPLE IQOO SAMSUNG VIVO
## APPLE 46 0 0 0
## IQOO 0 5 0 0
## SAMSUNG 0 3 52 16
## VIVO 0 0 8 18
mean(Prob.predict != Actual)
## [1] 0.1824324
We can see that logistic regression worked better than both LDA and QDA, with a misclassification rate of 0.1824324. This increased performance with logistic regression could be a result of the assumtion of homogenous covariance matrices not being met for LDA and doesn’t assume the distribution of the data to be Gaussian like QDA.
We will use Naive Bayes to see how it compares to LDA, QDA as well as logistic regression. Included below are the plots to show how the Naive Bayes Classifier works with the density of each of the different companies for a given variable.
library(naivebayes)
## naivebayes 0.9.7 loaded
set.seed(10)
sample2 <- sample(c(TRUE, FALSE), nrow(order3), replace=TRUE, prob=c(0.7,0.30))
train2 <- order3[sample, ]
test2 <- order3[-sample, ]
modelbayes <- naive_bayes(Company~Rating+Price+TotalReviwes+RomSize_inGB, data = train2, usekernel = T)
plot(modelbayes)
Now to use naive bayes to predict
pred <- predict(modelbayes, test2)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
(tab1 <- table(pred, test2$Company))
##
## pred APPLE IQOO SAMSUNG VIVO
## APPLE 46 0 0 0
## IQOO 0 5 9 0
## SAMSUNG 0 3 41 9
## VIVO 0 0 10 25
mean(pred != test2$Company)
## [1] 0.2094595
We can see that the misclassification using Naive Bayes is 0.2094595 on the test set. This is less accurate than multinomial logistic regression but performed better than LDA and QDA.
Why did Naive bayes perform worse than multinomial logistic regression? One of the reasons may be the feature independence assumption. We can check the correlation between the variables.
print(cor(cellphone[, c('Rating','Price','TotalReviwes','RomSize_inGB')]))
## Rating Price TotalReviwes RomSize_inGB
## Rating 1.0000000 0.70804693 0.42640871 0.37322466
## Price 0.7080469 1.00000000 0.04769482 0.57338900
## TotalReviwes 0.4264087 0.04769482 1.00000000 0.01339623
## RomSize_inGB 0.3732247 0.57338900 0.01339623 1.00000000
Clearly we have correlation between the independent variables used in the categorization of the phones. For this reason it may not be appropriate to use Naive Bayes for Classification.
We can try removing the rating out of the predictors since there is a high correlation with the others.
order5 <- order3[, (names(order3) %in% c('Price','TotalReviwes','RomSize_inGB', "Company"))]
set.seed(10)
sample3 <- sample(c(TRUE, FALSE), nrow(order5), replace=TRUE, prob=c(0.7,0.30))
train3 <- order5[sample, ]
test3 <- order5[-sample, ]
modelbayes2 <- naive_bayes(Company~Price+TotalReviwes+RomSize_inGB, data = train3, usekernel = T)
Now to use naive bayes to predict
pred2 <- predict(modelbayes2, test3)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
(tab2 <- table(pred2, test3$Company))
##
## pred2 APPLE IQOO SAMSUNG VIVO
## APPLE 43 0 0 0
## IQOO 3 6 10 0
## SAMSUNG 0 2 48 15
## VIVO 0 0 2 19
mean(pred2 != test3$Company)
## [1] 0.2162162
The misclassification rate increased slightly when removing rating as one of the predictors. For this reason, we will leave rating in the classifier.
We can also look at KNN, which is a non parametric classifier that does not make assumptions on the underlying distribution of the data.
We will perfome multiple runs of KNN, and varying the number of nearest neighbours k, in the classifier for each run to determine the most accurate classifier.
# Loading package
library(e1071)
library(caTools)
library(class)
# Fitting the Knn Model
classifier_knn <- knn(train = train[,1:4],
test = test[,1:4],
cl = train$Company,
k = 11)
# Confusion Matrix
cm <- table(test$Company, classifier_knn)
cm
## classifier_knn
## APPLE IQOO SAMSUNG VIVO
## APPLE 46 0 0 0
## IQOO 0 0 7 1
## SAMSUNG 2 0 45 13
## VIVO 3 0 15 16
# Calculate out of Sample error
misClassError <- mean(classifier_knn != test$Company)
misClassError
## [1] 0.277027
# Fitting the Knn Model
classifier_knn <- knn(train = train[,1:4],
test = test[,1:4],
cl = train$Company,
k = 9)
# Confusion Matrix
cm <- table(test$Company, classifier_knn)
cm
## classifier_knn
## APPLE IQOO SAMSUNG VIVO
## APPLE 46 0 0 0
## IQOO 0 1 7 0
## SAMSUNG 2 0 48 10
## VIVO 0 0 16 18
# Calculate out of Sample error
misClassError <- mean(classifier_knn != test$Company)
misClassError
## [1] 0.2364865
# Fitting the Knn Model
classifier_knn <- knn(train = train[,1:4],
test = test[,1:4],
cl = train$Company,
k = 7)
# Confusion Matrix
cm <- table(test$Company, classifier_knn)
cm
## classifier_knn
## APPLE IQOO SAMSUNG VIVO
## APPLE 46 0 0 0
## IQOO 0 1 7 0
## SAMSUNG 1 0 53 6
## VIVO 0 0 12 22
# Calculate out of Sample error
misClassError <- mean(classifier_knn != test$Company)
misClassError
## [1] 0.1756757
# Fitting the Knn Model
classifier_knn <- knn(train = train[,1:4],
test = test[,1:4],
cl = train$Company,
k = 5)
# Confusion Matrix
cm <- table(test$Company, classifier_knn)
cm
## classifier_knn
## APPLE IQOO SAMSUNG VIVO
## APPLE 46 0 0 0
## IQOO 1 2 5 0
## SAMSUNG 0 0 49 11
## VIVO 0 1 7 26
# Calculate out of Sample error
misClassError <- mean(classifier_knn != test$Company)
misClassError
## [1] 0.1689189
# Fitting the Knn Model
classifier_knn <- knn(train = train[,1:4],
test = test[,1:4],
cl = train$Company,
k = 3)
# Confusion Matrix
cm <- table(test$Company, classifier_knn)
cm
## classifier_knn
## APPLE IQOO SAMSUNG VIVO
## APPLE 46 0 0 0
## IQOO 1 2 5 0
## SAMSUNG 0 0 49 11
## VIVO 0 1 3 30
# Calculate out of Sample error
misClassError <- mean(classifier_knn != test$Company)
misClassError
## [1] 0.1418919
We can see from our results that the most accurate classifier is when k = 5.
In conclusion, we have evaluated five different classifiers, including K-nearest neighbors (KNN), Naive Bayes, logistic regression, Quadratic Discriminant Analysis (QDA), and Linear Discriminant Analysis (LDA), on a dataset that did not follow a Gaussian distribution and had unequal covariance matrices as well as correlation between predictors. Our results show that KNN had the lowest misclassification rate of 0.16, followed by logistic regression at 0.18, QDA and Naive Bayes at 0.21, and LDA at 0.24. These findings suggest that KNN and logistic regression may be effective classifiers for non-Gaussian and non-equal covariance data, while QDA, Naive Bayes, and LDA may not perform as well.
We should emphasize that this part of the analysis is not particularly useful in a dataset with less than 10 predictors like ours. It was still carried for its value as a learning exercise. This point was brought up and addressed during our presentation. With that in mind let’s go through the motions.
PCA is a tool that projects a collection of vectors into a a new vector space of fewer orthogonal dimensions. This projection can be simpler to fit using a linear model but much harder to interpret: the new predictors are going to be linear combinations of the original predictors and may not have a direct interpretation. To carry out a PCA one must have correlation among features,
PCA is
sensitive to outliers and scale. This means that if our numerical
predictors have different units or span different orders of magnitude we
need to re-scale it. For every predictor, this is accomplished by simply
subtracting the mean and dividing by its standard deviation. In python,
the library scikit learn support different types of scaling out
of the box. It can also perform PCA,
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
principal_components = pca.fit_transform(df_scaled)
The following figure summarizes the results,
We can see
that about 70% of the variance can be explained by projecting onto a 2D
space. To use this in a prediction scenario we’d have to setup a
pipeline that follows this workflow,
This is less of a hassle when dealing with many predictors and the gains from dimensionality reduction are tangible.
Now, after we have a better understanding of the data, we can start
building a model to predict the price of a mobile phone. First, we will
build a Multiple Linear Regression model to predict the price of a
mobile phone to see if there are linear relationships between
Price and dependent variables. Before we start, here are
some steps we will take: 1. Split the data into training and test sets.
2. Explore collinearity between the variables. 3. Build a base valid
model. 4. If there is noticable pattern in the residual plot, introduce
interaction terms and higher order terms. 5. Check the assumptions of
the model. 6. If any of the assumptions are violated, try to fix them
with transformations and outliers removal.
At each stage of the model building process, we will check the performance of the model using the test set.
In case the multiple linear regression model does not perform well, or the relationship is non-linear we will build a Regression Tree that can model more complex relationships between the variables.
As our initial step, we will use the following variables as
predictors: Rating, No_of_ratings,
TotalReviwes, RamSize_inGB,
RomSize_inGB, Is_5G. We will not use
Company as a predictor because it is a categorical variable
with many levels and it will make the model more complex. We will use
Price as the response variable. Again, we will use
train data set to build the model and test data set to
evaluate the model.
# Split the data into training and test sets
set.seed(2023)
train_index <- sample(1:nrow(mobile_dataset_mlr), 0.8 * nrow(mobile_dataset_mlr))
train_data <- mobile_dataset_mlr[train_index, ]
test_data <- mobile_dataset_mlr[-train_index, ]
Let’s explore the relationship between the variables using the correlation matrix.
# Explore collinearity between the variables
# Plot the correlation matrix with the help of the ggally package
library('GGally')
library('ggplot2')
ggpairs(train_data, lower = list(continuous = "smooth_loess", combo = "facethist", discrete = "facetbar", na = "na")) +
scale_color_gradient(high = "red", low = "blue")
Let’s apply VIF test to confirm if there is any collinearity between the variables.
# Check the “variance inflation factor” for each coefficient using imcdiag function from the package "regclass"
library('regclass')
VIF(mlr_full_model)
## Rating No_of_ratings TotalReviwes RamSize_inGB RomSize_inGB
## 1.585407 14.057267 13.412710 2.065303 2.328033
## factor(Is_5G)
## 1.320028
From the VIF values, we can see that two variables are collinear. And
form the correlation matrix the relationship between the variables is
also high (96%). So, we will remove the variable with the highest VIF
value, that is No_of_ratings.
Then we update the model by removing No_of_ratings and
check significance of the variables.
# Update the model by removing the variable with the highest VIF value
mlr_updated_model <- lm(Price ~ Rating + TotalReviwes + RamSize_inGB + RomSize_inGB + factor(Is_5G), data = train_data)
summary(mlr_updated_model)
##
## Call:
## lm(formula = Price ~ Rating + TotalReviwes + RamSize_inGB + RomSize_inGB +
## factor(Is_5G), data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34720 -4432 -1312 2274 88499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.894e+04 1.064e+04 -7.421 4.21e-13 ***
## Rating 2.052e+04 2.600e+03 7.890 1.54e-14 ***
## TotalReviwes -4.146e-01 1.197e-01 -3.464 0.000572 ***
## RamSize_inGB -3.154e+03 2.252e+02 -14.006 < 2e-16 ***
## RomSize_inGB 2.650e+02 1.154e+01 22.965 < 2e-16 ***
## factor(Is_5G)Yes 2.773e+02 1.239e+03 0.224 0.822978
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10990 on 574 degrees of freedom
## Multiple R-squared: 0.6359, Adjusted R-squared: 0.6327
## F-statistic: 200.5 on 5 and 574 DF, p-value: < 2.2e-16
From the individual t-test, we can see that the variable
Is_5G is not significant with p-value 82%. Therefore, we
will remove it from the model.
The adjusted R-squared value is 0.6333551, which means that the model explains 63.335511% of the variance in the data.
Now, all the variables are significant. Therefore, the model is valid and can be used for prediction.
# Predict the price of the test data
test_data$predicted_price_mlr_base <- predict(mlr_updated_model, test_data)
# Negative values are not possible for the price of a mobile phone. Therefore, we will replace the negative values with 0.
test_data$predicted_price_mlr_base[test_data$predicted_price_mlr_base < 0] <- 0
# Calculate the mean square error
rmse1 <- sqrt(mean((test_data$Price - test_data$predicted_price_mlr_base)^2))
The RMSE value is 1.0681819^{4}. The RMSE value is high. Therefore, we will try to improve the model by introducing interaction terms and higher order terms.
Let’s plot the predicted values vs actual values to see how well the model fits the data.
# Check assumptions of the model
# Plot the residuals vs fitted values
plot(mlr_updated_model, which = 3)
The residual plot shows that the model is not a good fit for the data. The residuals are not randomly distributed around the horizontal axis. There is a pattern in the residual plot. Therefore, we will try to improve the model by introducing interaction terms and higher order terms.
Let’s improve the model by trying interaction terms.
# Build a multiple linear regression model with interaction terms
mlr_interaction_model <- lm(Price ~ Rating + TotalReviwes + RamSize_inGB + RomSize_inGB + Rating:TotalReviwes + Rating:RamSize_inGB + Rating:RomSize_inGB + TotalReviwes:RamSize_inGB + TotalReviwes:RomSize_inGB + RamSize_inGB:RomSize_inGB, data = train_data)
summary(mlr_interaction_model)
##
## Call:
## lm(formula = Price ~ Rating + TotalReviwes + RamSize_inGB + RomSize_inGB +
## Rating:TotalReviwes + Rating:RamSize_inGB + Rating:RomSize_inGB +
## TotalReviwes:RamSize_inGB + TotalReviwes:RomSize_inGB + RamSize_inGB:RomSize_inGB,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37220 -3142 -859 1480 67354
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.099e+04 1.243e+04 -6.513 1.62e-10 ***
## Rating 2.070e+04 3.099e+03 6.678 5.76e-11 ***
## TotalReviwes -1.671e+00 3.310e+00 -0.505 0.6138
## RamSize_inGB 9.065e+04 5.214e+03 17.386 < 2e-16 ***
## RomSize_inGB -4.480e+03 2.510e+02 -17.844 < 2e-16 ***
## Rating:TotalReviwes 4.452e-01 7.479e-01 0.595 0.5519
## Rating:RamSize_inGB -2.109e+04 1.202e+03 -17.547 < 2e-16 ***
## Rating:RomSize_inGB 1.048e+03 5.508e+01 19.029 < 2e-16 ***
## TotalReviwes:RamSize_inGB 8.202e-02 4.804e-02 1.707 0.0884 .
## TotalReviwes:RomSize_inGB -1.059e-02 2.020e-03 -5.242 2.24e-07 ***
## RamSize_inGB:RomSize_inGB 1.056e+01 2.061e+00 5.124 4.09e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8150 on 569 degrees of freedom
## Multiple R-squared: 0.8016, Adjusted R-squared: 0.7981
## F-statistic: 229.9 on 10 and 569 DF, p-value: < 2.2e-16
Let’s remove the interaction terms with insignificant p-values and
build the model again. As TotalReviwes has the p-value
higher than 0.05, we still want to keep in the model, as it is a part of
the significant interaction term
TotalReviwes:RomSize_inGB.
# Update the model by removing the variable with the insignificant p-value
mlr_interaction_updated_model <- lm(Price ~ Rating + TotalReviwes + RamSize_inGB + RomSize_inGB + Rating:RamSize_inGB + Rating:RomSize_inGB + TotalReviwes:RomSize_inGB + RamSize_inGB:RomSize_inGB, data = train_data)
The adjusted R-squared value is 0.7977683, which means that the model explains 79.7768343% of the variance in the data.
Now, all the variables are significant. Therefore, the model is valid and can be used for prediction.
# Predict the price of the test data
test_data$predicted_price_mlr_interaction <- predict(mlr_interaction_updated_model, test_data)
# Negative values are not possible for the price of a mobile phone. Therefore, we will replace the negative values with 0.
test_data$predicted_price_mlr_interaction[test_data$predicted_price_mlr_interaction < 0] <- 0
# Calculate the mean square error
rmse2 <- sqrt(mean((test_data$Price - test_data$predicted_price_mlr_interaction)^2))
# Print the mean square error
print(paste("RMSE:", rmse2))
## [1] "RMSE: 11615.5191247919"
We can see that the RMSE value is higher than the base model. This is because the model is overfitting the data. Let’s try to improve the model by adding polynomial terms.
We start by adding a polynomial term to the most highly correlated
variable with the target variable, that is Rating. We start
with a quadratic term and then try higher orders. We stop when some of
the terms become insignificant. It turned out that the cubic term is the
best fit for the model. Here is the summary of the model.
# Build polynomial regression model
mlr_poly_model <- lm(Price ~ poly(Rating, 3) + TotalReviwes + RamSize_inGB + RomSize_inGB + Rating:RamSize_inGB + Rating:RomSize_inGB + TotalReviwes:RomSize_inGB + RamSize_inGB:RomSize_inGB, data = train_data)
The adjusted R-squared value is 0.8160948, which means that the model explains 81.6094801% of the variance in the data. It is a very good fit for the data. But let’s apply the model to the test data and see how well it performs.
# Predict the price of the test data
test_data$predicted_price_mlr_poly <- predict(mlr_poly_model, test_data)
# Negative values are not possible for the price of a mobile phone. Therefore, we will replace the negative values with 0.
test_data$predicted_price_mlr_poly[test_data$predicted_price_mlr_poly < 0] <- 0
# Calculate the mean square error
rmse3 <- sqrt(mean((test_data$Price - test_data$predicted_price_mlr_poly)^2))
The RMSE values (1.0145555^{4}) slightly improved compared to the base model. Let’s check the assumptions of the model.
Homoscedasticity - constant variance of the residuals. The residuals should be randomly distributed around the horizontal axis.
# Check assumptions of the model
# Plot the residuals vs fitted values
plot(mlr_poly_model, which = 3)
The residuals are not randomly distributed around the horizontal axis. There is a pattern in the residual plot. Therefore, we will try to improve the model by transforming the target variable. But first, let’s cross-check homoskedasticity with the Breusch-Pagan test. And later check the other assumptions of the model as well.
library(lmtest)
bptest(mlr_poly_model)
##
## studentized Breusch-Pagan test
##
## data: mlr_poly_model
## BP = 106.01, df = 10, p-value < 2.2e-16
The output displays the Breusch-Pagan test that results from the MLR polynomial model. The p-value = 0 < 0.05, indicating that we reject the null hypothesis. Therefore, the test provides evidence to suggest that heteroscedasticity is present - non-constant variance.
Normality - the residuals should be normally distributed.
# Normality test
par(mfrow=c(1,2))
hist(residuals(mlr_poly_model), breaks = 24)
plot(mlr_poly_model, which=2) #a Normal plot
The residuals are not normally distributed. QQ-plot demonstrates significant deviations from the “normal” line. Let’s cross-check the normality of the residuals with the Shapiro-Wilk test.
H0: the sample data are significantly normally distributed Ha: the sample data are not significantly normally distributed
shapiro.test(residuals(mlr_poly_model))
##
## Shapiro-Wilk normality test
##
## data: residuals(mlr_poly_model)
## W = 0.7283, p-value < 2.2e-16
The p-value(2.2e-16) is much lower than 0.05, which confirms that the residuals are not normally distributed (reject null hypothesis).
Linearity - the relationship between the independent variables and the dependent variable should be linear.
ggplot(mlr_poly_model, aes(x=.fitted, y=.resid)) +
geom_point() + geom_smooth()+
geom_hline(yintercept = 0)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The residuals deviate more and more from the horizontal axis as the fitted values increase. Therefore, the relationship between the independent variables and the dependent variable is not linear.
Therefore, we will try to fix the assumptions by transforming the target variable. We will use the Box-Cox transformation. ### Box-Cox transformation
library("MASS")
bc = boxcox(mlr_poly_model,lambda=seq(-1,1))
bestlambda = bc$x[which(bc$y==max(bc$y))]
mlr_bcmodel = lm((((Price^bestlambda)-1)/bestlambda) ~ poly(Rating, 3) + TotalReviwes + RamSize_inGB + RomSize_inGB + Rating:RamSize_inGB + Rating:RomSize_inGB + TotalReviwes:RomSize_inGB + RamSize_inGB:RomSize_inGB, data=train_data)
After the transformation and removing the outliers, we still weren’t able to fix the assumptions of the model (code is not provided in the sake of brevity). Even though the adjusted R-squared value is 0.8160948 we can’t use the model for prediction.
All steps of the model building process are summarized in the following diagram.
Therefore, we will try Regression Tree to model more complex relationships between the variables. And as our final step in this chapter, we will compare the RMSE values of the models using cross-validation.
# Create a new data frame with the variables of interest
column_names_tree <- c("Price", "Rating", "No_of_ratings", "TotalReviwes", "RamSize_inGB", "RomSize_inGB", "Is_5G", "Company")
mobile_dataset_tree <- mobile_dataset[column_names_tree]
# Remove missing values from the dataset
mobile_dataset_tree <- mobile_dataset_tree %>%
filter(!is.na(Price))
# Split the data into train and test sets
set.seed(2023)
train_data <- mobile_dataset_tree[sample(1:nrow(mobile_dataset_tree), 0.7*nrow(mobile_dataset_tree)), ]
test_data <- mobile_dataset_tree[-sample(1:nrow(mobile_dataset_tree), 0.7*nrow(mobile_dataset_tree)), ]
# Fit a regression tree to the data
library(tree)
tree_model <- tree(Price ~ ., data = train_data)
## Warning in tree(Price ~ ., data = train_data): NAs introduced by coercion
summary(tree_model)
##
## Regression tree:
## tree(formula = Price ~ ., data = train_data)
## Variables actually used in tree construction:
## [1] "RamSize_inGB" "No_of_ratings" "RomSize_inGB"
## Number of terminal nodes: 8
## Residual mean deviance: 38980000 = 1.945e+10 / 499
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -26340.0 -2831.0 -207.4 0.0 1589.0 40560.0
Our initial tree has 8 terminal nodes.
# Plot the tree
plot(tree_model)
text(tree_model, pretty = 0)
# Apply the tree to the test set
tree_pred <- predict(tree_model, test_data)
# Calculate the RMSE
rmse5 <- sqrt(mean((tree_pred - test_data$Price)^2))
The RMSE value is 5573.7100824. Let’s prune the tree. To do that, we will first check the plot between the cross-valiation error and the size of the tree.
We will use 4 as the best number of terminal nodes. Let’s prune the tree and calculate the RMSE.
# Prune the tree
pruned_tree_model <- prune.tree(tree_model, best = 6)
# Plot the pruned tree
plot(pruned_tree_model)
text(pruned_tree_model, pretty = 0)
The RMSE after pruning is 6104.2294325.
Let’s create 10 folds and calculate the cross-validation error for both models.
Cross-validation error for our best linear regression model is 7679.4340007.
Cross-validation error for our best linear regression model is 6433.6142975.
In this chapter, we have built a linear regression model and a regression tree model to predict the price of a mobile phone. We have used the RMSE value and cross-validation method to compare the models. The RMSE value for the linear regression model is 1.0145555^{4} and the RMSE value for the regression tree model is 6104.2294325. Therefore, the regression tree model is better than the linear regression model. Besides, the Multiple Linear Regression model doesn’t meet the assumptions. Comparing to the Regression Tree model, it doesn’t have such assumptions. Therefore, we recommend using the Regression Tree model for prediction.
2020 review. “High-End Mobile Phones Price Have Soared 490% in 20 Years | This Is Money.” This Is Money, This Is Money, 23 July 2020, https://www.thisismoney.co.uk/money/bills/article-8548235/High-end-mobile-phones-price-soared-490-20-years.html.
MobilePhone’s dataset. “MobilePhone’s Dataset | Kaggle.” Kaggle: Your Machine Learning and Data Science Community, Kaggle, 20 Dec. 2022, https://www.kaggle.com/datasets/sudhanshuy17/mobilephone.
Pascal Schmidt. “Assumption Checking of LDA vs. QDA – R Tutorial Pima Indians Data Set”, retrieved Feb 17 2023, https://thatdatatho.com/assumption-checking-lda-vs-qda-r-tutorial-2/.